relative loss
Group Relative Knowledge Distillation: Learning from Teacher's Relational Inductive Bias
Li, Chao, Zhou, Changhua, Chen, Jia
Knowledge distillation typically transfers knowledge from a teacher model to a student model by minimizing differences between their output distributions. However, existing distillation approaches largely focus on mimicking absolute probabilities and neglect the valuable relational inductive biases embedded in the teacher's relative predictions, leading to exposure bias. In this paper, we propose Group Relative Knowledge Distillation (GRKD), a novel framework that distills teacher knowledge by learning the relative ranking among classes, rather than directly fitting the absolute distribution. Specifically, we introduce a group relative loss that encourages the student model to preserve the pairwise preference orderings provided by the teacher's outputs. Extensive experiments on classification benchmarks demonstrate that GRKD achieves superior generalization compared to existing methods, especially in tasks requiring fine-grained class differentiation. Our method provides a new perspective on exploiting teacher knowledge, focusing on relational structure rather than absolute likelihood.
The Surprising Benefits of Base Rate Neglect in Robust Aggregation
Kong, Yuqing, Wang, Shu, Wang, Ying
Robust aggregation integrates predictions from multiple experts without knowledge of the experts' information structures. Prior work assumes experts are Bayesian, providing predictions as perfect posteriors based on their signals. However, real-world experts often deviate systematically from Bayesian reasoning. Our work considers experts who tend to ignore the base rate. We find that a certain degree of base rate neglect helps with robust forecast aggregation. Specifically, we consider a forecast aggregation problem with two experts who each predict a binary world state after observing private signals. Unlike previous work, we model experts exhibiting base rate neglect, where they incorporate the base rate information to degree $\lambda\in[0,1]$, with $\lambda=0$ indicating complete ignorance and $\lambda=1$ perfect Bayesian updating. To evaluate aggregators' performance, we adopt Arieli et al. (2018)'s worst-case regret model, which measures the maximum regret across the set of considered information structures compared to an omniscient benchmark. Our results reveal the surprising V-shape of regret as a function of $\lambda$. That is, predictions with an intermediate incorporating degree of base rate $\lambda<1$ can counter-intuitively lead to lower regret than perfect Bayesian posteriors with $\lambda=1$. We additionally propose a new aggregator with low regret robust to unknown $\lambda$. Finally, we conduct an empirical study to test the base rate neglect model and evaluate the performance of various aggregators.
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > Oregon (0.04)
- North America > United States > Colorado > Boulder County > Boulder (0.04)
- (3 more...)
Information loss from dimensionality reduction in 5D-Gaussian spectral data
Understanding the loss of information in spectral analytics is a crucial first step towards finding root causes for failures and uncertainties using spectral data in artificial intelligence models built from modern complex data science applications. Here, we show from an elementary Shannon entropy model analysis with quantum statistics of Gaussian distributed spectral data, that the relative loss of information from dimensionality reduction due to the projection of an initial five-dimensional dataset onto two-dimensional diagrams is less than one percent in the parameter range of small data sets with sample sizes on the order of few hundred data samples. From our analysis, we also conclude that the density and expectation value of the entropy probability distribution increases with the sample number and sample size using artificial data models derived from random sampling Monte Carlo simulation methods.
Maximum Optimality Margin: A Unified Approach for Contextual Linear Programming and Inverse Linear Programming
Sun, Chunlin, Liu, Shang, Li, Xiaocheng
The predict-then-optimize problem considers a learning problem under a decision making context where the output of a machine learning model serves as the input of a downstream optimization problem (e.g. a linear program). The ultimate goal of the learner is to prescribe a decision/solution for the downstream optimization problem using directly the input (variables) of the machine learning model but without full observation of the input of the optimization problem. A similar problem formulation was also studied as prescriptive analytics (Bertsimas and Kallus, 2020) and contextual linear programming (Hu et al., 2022). While Elmachtoub and Grigas (2022) justifies the importance of leveraging the optimization problem structure when building the machine learning model, the existing efforts on exploiting the optimization structure have been largely inadequate. In this paper, we delve deeper into the structural properties of the optimization problem and propose a new approach called maximum optimality margin which builds a max-margin learning model based on the optimality condition of the downstream optimization problem. More importantly, our approach only needs the observations of the optimal solution in the training data rather than the objective function, thus it draws an interesting connection to the inverse optimization problem. The connection gives a new shared perspective on both the predict-then-optimize problem and the inverse optimization problem, and our analysis reveals a scale inconsistency issue that arises practically and theoretically for many existing methods.
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Asia > Middle East > Jordan (0.04)
Linear Hinge Loss and Average Margin
We describe a unifying method for proving relative loss bounds for on(cid:173) line linear threshold classification algorithms, such as the Perceptron and the Winnow algorithms. For classification problems the discrete loss is used, i.e., the total number of prediction mistakes. We introduce a con(cid:173) tinuous loss function, called the "linear hinge loss", that can be employed to derive the updates of the algorithms. We first prove bounds w.r.t. the linear hinge loss and then convert them to the discrete loss. We show how relative loss bounds based on the linear hinge loss can be converted to relative loss bounds i.t.o. the discrete loss using the average margin.
Matrix Exponential Gradient Updates for On-line Learning and Bregman Projection
We address the problem of learning a symmetric positive definite matrix. The central issue is to design parameter updates that preserve positive definiteness. Our updates are motivated with the von Neumann diver- gence. Rather than treating the most general case, we focus on two key applications that exemplify our methods: On-line learning with a simple square loss and finding a symmetric positive definite matrix subject to symmetric linear constraints. The updates generalize the Exponentiated Gradient (EG) update and AdaBoost, respectively: the parameter is now a symmetric positive definite matrix of trace one instead of a probability vector (which in this context is a diagonal positive definite matrix with trace one).
Decentralized Feature-Distributed Optimization for Generalized Linear Models
Ancelin, Brighton, Bahmani, Sohail, Romberg, Justin
We consider the "all-for-one" decentralized learning problem for generalized linear models. The features of each sample are partitioned among several collaborating agents in a connected network, but only one agent observes the response variables. To solve the regularized empirical risk minimization in this distributed setting, we apply the Chambolle--Pock primal--dual algorithm to an equivalent saddle-point formulation of the problem. The primal and dual iterations are either in closed-form or reduce to coordinate-wise minimization of scalar convex functions. We establish convergence rates for the empirical risk minimization under two different assumptions on the loss function (Lipschitz and square root Lipschitz), and show how they depend on the characteristics of the design matrix and the Laplacian of the network.
- North America > United States > Virginia (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > Florida > Orange County > Orlando (0.04)
- (3 more...)
Stochastic Bandits with Vector Losses: Minimizing $\ell^\infty$-Norm of Relative Losses
Shang, Xuedong, Shao, Han, Qian, Jian
Multi-armed bandits are widely applied in scenarios like recommender systems, for which the goal is to maximize the click rate. However, more factors should be considered, e.g., user stickiness, user growth rate, user experience assessment, etc. In this paper, we model this situation as a problem of $K$-armed bandit with multiple losses. We define relative loss vector of an arm where the $i$-th entry compares the arm and the optimal arm with respect to the $i$-th loss. We study two goals: (a) finding the arm with the minimum $\ell^\infty$-norm of relative losses with a given confidence level (which refers to fixed-confidence best-arm identification); (b) minimizing the $\ell^\infty$-norm of cumulative relative losses (which refers to regret minimization). For goal (a), we derive a problem-dependent sample complexity lower bound and discuss how to achieve matching algorithms. For goal (b), we provide a regret lower bound of $\Omega(T^{2/3})$ and provide a matching algorithm.
Relative Loss Bounds for On-line Density Estimation with the Exponential Family of Distributions
Azoury, Katy S., Warmuth, Manfred K.
We consider on-line density estimation with a parameterized density from the exponential family. The on-line algorithm receives one example at a time and maintains a parameter that is essentially an average of the past examples. After receiving an example the algorithm incurs a loss which is the negative log-likelihood of the example w.r.t. the past parameter of the algorithm. An off-line algorithm can choose the best parameter based on all the examples. We prove bounds on the additional total loss of the on-line algorithm over the total loss of the off-line algorithm. These relative loss bounds hold for an arbitrary sequence of examples. The goal is to design algorithms with the best possible relative loss bounds. We use a certain divergence to derive and analyze the algorithms. This divergence is a relative entropy between two exponential distributions.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- North America > United States > California > Santa Cruz County > Santa Cruz (0.14)
- North America > United States > New York > New York County > New York City (0.04)
- (4 more...)
Matrix Exponential Gradient Updates for On-line Learning and Bregman Projection
Tsuda, Koji, Rätsch, Gunnar, Warmuth, Manfred K. K.
We address the problem of learning a symmetric positive definite matrix. The central issue is to design parameter updates that preserve positive definiteness. Our updates are motivated with the von Neumann divergence. Rather than treating the most general case, we focus on two key applications that exemplify our methods: Online learning with a simple square loss and finding a symmetric positive definite matrix subject to symmetric linear constraints. The updates generalize the Exponentiated Gradient (EG) update and AdaBoost, respectively: the parameter is now a symmetric positive definite matrix of trace one instead of a probability vector (which in this context is a diagonal positive definite matrix with trace one). The generalized updates use matrix logarithms and exponentials to preserve positive definiteness. Most importantly, we show how the analysis of each algorithm generalizes to the non-diagonal case. We apply both new algorithms, called the Matrix Exponentiated Gradient (MEG) update and DefiniteBoost, to learn a kernel matrix from distance measurements.
- Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.14)
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- (6 more...)